Text Preparation through Extended Tokenization

نویسندگان

Marcus Hassler

Günther Fliedl

چکیده

Tokenization is commonly understood as the first step of any kind of natural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of processing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle morphosyntactically relevant linguistic specificities. Therefore, we propose rulebased Extended Tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core features of our implementation are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including singleand multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks. In this paper, we focus on the task of improving the quality of standard tagging.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...

متن کامل

STeP-1: A Set of Fundamental Tools for Persian Text Processing

Many NLP applications need fundamental tools to convert the input text into appropriate form or format and extract the primary linguistic knowledge of words and sentences. These tools perform segmentation of text into sentences, words and phrases, checking and correcting the spellings, doing lexical and morphological analysis, POS tagging and so on. Persian is among languages with complex prepr...

متن کامل

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Tokenization is very helpful for StatisticalMachine Translation (SMT), especiallywhen translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within...

متن کامل

Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation

Urdu is morphologically rich language with different nature of its characters. Urdu text tokenization and sentence boundary disambiguation is difficult as compared to the language like English. Major hurdle for tokenization is improper use of space between words, where as absence of case discrimination makes the sentence boundary detection a difficult task. In this paper some issues regarding b...

متن کامل

A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization

Tokenization in the bioscience domain is often difficult. New terms, technical terminology, and nonstandard orthography, all common in bioscience text, contribute to this difficulty. This paper will introduce the tasks of tokenization, normalization before introducing BAccHANT, a system built for bioscience text normalization. Casting tokenization / normalization as a problem of punctuation cla...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Text Preparation through Extended Tokenization

نویسندگان

چکیده

منابع مشابه

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

STeP-1: A Set of Fundamental Tools for Persian Text Processing

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation

A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization

عنوان ژورنال:

اشتراک گذاری